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(54) Error recovery 

(57) Tha present invention provides a method and 
apparatus for error recovery in a system. The apparatus 
comprises a directory oache adapted to store at least 
one entry and a control unit. The control unit is adapted 
to determine if at least one uncorrectable error exists in 
the directory cache and to place the directory cache of- 
fline in response to determining that the error is uncor- 



rectable. The method comprises detecting an error in 
data stored in a storage device in the system, and de- 
termining if the detected error Is correctable. The meth- 
od further comprises making at least a portion of the 
storage davice unavailable to one or more resources in 
the system in response to determining that the error is 
uncorrectable. 
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□ascription 

BACKGROUND OF THE INVENTION 

s 1. FIELD OFTHE INVENTION 

[00011 This Invention relates generally to proceesorvbased systems, and. more particularly, to error recovery (n a 
directory cache of a distributed, shared -memory processor-based system. 

10 2. DESCRIPTION OF THE RELATED ART 

[0002] Businesses typically rely on network computing to maintain a competitive advantage over other businesses, 
As such, developers, when designing processor-based systems for use In network-centric environments, may take 
several factors into consideration to meat the expectation of the customers, factors such as functionality, reliability, 

'5 scalability, and performance of such systems, 

[0003] One example of a processor-based system used in a network-centric environment is a mid-range server 
system. A single mid-range server system may have a plurality of system boards that may, for example, be configured 
as one or more domains, where a domain, for example, may act as a separate machine by running its own instance 
of an operating system to perform on© or more of the configured tasks, 

& [0004] A mid-range server, in one embodiment, may employ a distributed shared memory system, where processor 
from one system board can access memory contents from another system board. The union of all of the memories on 
the system boards of the mid-range server comprises a distributed shared memory (DSM). 

[0005] One method of accessing data from other system boards within a system is to broadcast a memory request 
on a common bus. For example, if a requesting system board desires to access information stored in a memory line 
25 residing in a memory of another system board, the requesting system board typically broadcasts on the common bus 
its memory access request. All of the system boards in the system may receive the same request, and the system 
board whose memory address ranges match the memory address provided in the memory access request may then 
respond. 

[0006] The broadcast approach for accessing contents of memories in other system boards may work adequatsly 
30 when a relatively small number of system boards are present in a system. However, such an approach may be unsuit- 
able as the number of system boards grows. As the number of system boards grows, eo does the number of memory 
access requests, thus to handle this increased traffic, larger and faster buses may be needed to allow the memory 
accesses to complete inatlmely manner. Operating a large busat high speeds may be problematic because of electrical 
concerns, in part, due to high capacitance, inductance, and the like. Furthermore, a larger number of boards within a 
35 system may require extra broadcasts, which c 
power to handle the extra broadcasts. 

[0007] Designers have proposed the use of directory caches In a distributed shared memory systems to reduce the 
need for globally broadcasting memory requests. Typically, each system board serves as home boardfor memory lines 
within a selected memory address range, and where each system board is aware of the memory address ranges 

4o belonging to the other system boards within the system. Each home board generally maintains its own directory cache 
for memory lines that fall within its address range. Thus, when a requesting board desires to access memory contents 
from another board, instead of generally broadcasting the memory request in the System, the request is transmitted 
to the appropriate home board. The home board may consult its directory cache and determine which system board 
is capable of responding to the memory request. 

43 [OOOfl] Directory caches are generally effective in reducing the need for globally broadcasting memory requests 
durfng memory accesses. However, as would be expected, the effectiveness of the directory caches depends in part 
on the directory caches being properly operational while the system is running. An inoperable or a partially inoperable 
(/.e„ functioning but with one or more errors) directory cache may sometimes go undetected for extended periods of 
time, and may thereby adversely affect the overall operation of the system. 

so 

SUMMARY OF THE INVENTION 

(0009] In one aspect of the instant invention, an apparatus is provided for error recovery in a system. The apparatus 
comprises a directory cache adapted to store at least one entry and a control unit The control unit is adapted to 
s$ determine if at least one uncorrectable error exists in the directory cache and to place the directory cache offline in 
response to determining that the error ia uncorrectable. 

[0010] In another aspect of the present invention, a method is provided for error recovery in a system. The method 
comprises detecting an error in data stored In a storage device in the system, and determining if the detected error is 
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TJ^^iT ™***!» n **r comprises making at toast a portion of the storage device unavailable to one or more 
resources in the system In response to determining that the error Is uncorrectable 

10011] In yet another aspect of the instant invention, an article comprising one or more machine-readable storage 
media contatnlng instructions is provided for error recovery. The instructions, when executed, may enable a pressor 
to determrne a multiple*!, error in data stored in a storage device of a domain and to isolate?^ ™K ~ 

ZhZSoZoT °" e ° r m0fe reS0UfCee in thS d ° main Wh " e ,he d0maln " ac,lve ' m ™*> 9r ** » determining the 
BRIEF DESCRIPTION OF THE DRAWINGS 

10012] The invention may be understood by reference to the following descrption taken in conjunction with the ac 
companylng drawings, In which like reference numerals identify like elements, and In which: 

Figure 1 shows a stylized block diagram of a system in accordance with one embodiment of the present invention; 

Figure 2 illustrates a block diagram of an exemplary domain configuration that may bo employed in the svstem of 
Figure 1, in accordance with one embodiment of the present invention: yeomrne system of 

Figure 3 depicts a stylized block diagram of one system board set that may be employed in the system of Rqure 
1. in accordance with one embodiment of the present invention; 9 

accordance with one embodiment of the present invention; 

Figures illustrates flow diagram of a method forstoring an entry in a directory cache of the system of Finure 1 in 
accordance w*h one embodiment of the present invention; and 9 ' " 

Figures 6A-B Illustrate a flow diagram of a method for enor recovery in a directory cache of the system ol Fioure 
1, in accordance with one embodiment of the present invention. 8 

[0013] While the invention Is susceptible to various modifications and alternative forms, specific embodiments thereof 
have been shown by way of example in the drawings and are herein described in dStSS? 

disclosed, but on the contrary, theintentton is to cover all modifications, equivalents and alternative,, flm^Jtl ZZl 
spirit and scope of the invention as defined by the appended claims. BqUWa,8ras ' and al, *™tlves 'ailing within the 

DETAILED DESCRIPTION O F SPECIFIC EMBODIMENTS 

[0014] Illustrative embodiments of the Invention are described below. In the Interest of claritv not ail tea,.,™ m 
actual i^lementation era described in this specification, .t win of course be Ippr^L ^h^£2^« «! 
any such actual embodiment, numerous imptementation-specific decisions must be made w achieve xZ^2Lt 
specific goals, such as compliance with system-retated and business-related constraints Jhic Twil ™ ™ a 
implementation toanother. Moreover, itwi« be appreciated. hat such adevetopm^eSn oomZ Indtlms! 

[0016] As will be described in more detail below, In accordance w«h one or more embodiments of the present inven- 
i a ti nDP , £ ° r partialy inCpBfabl9 direCta,y cacha ms * ba te S ical| y -fVfcad. and Z Tn^mS 

[0016] Referring nowto Figure 1, a block diagram of a system 10 in accordance with one embodiment of the or^nt 
invent™ is illustrated. The system 10. in one embodiment, includes a plurality tsZ^TiSSJSK 
an, coupled to a switch 20. For illustrative purposes, lines 21 (1-2) are i5^ll^^^«22J2 

SaLs ' " ^ 01 3 Var,Sty ° f WayS " inC ' udin9 b * erf 5 6 «"™«°<*. "bios, or other available 

[0017] Intheinuetratedembodlment.thssyetem 10!ncludestwo control boards 16(1-2) oneformanaoinotf^m^ii 
operation of the system 10 and the other to provide redundancy and automatic faBovlr in i 1 
Uk Although not so iimrted. in the illustjed ~**nJ&1lX^ 
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system comrol board, whllsths seconds^ o _ 
control board. In on* ambodimant, during any gi.an momVn , «3 £ ^STSS!52ff SSfS 
actively controls the overall operations of the syetem 10. ^ °' bMrds 1 5(1 ' 2 > 

10018] IJfailures of the hardware or software occur on the malnsyetem control boaid 15m or fail um « «« h,^. 
control path from the main system control board 15(1) to other system dS o2u 'the ^^^1^^ 
software 22 automatically triggers a failover to the alternative control boaidTsm T contro|ter ,a,lover 

each include a respective control unit 23(1 -2). 4 ' " lust «ed embodiment may 

[0019] The system 10, in one embodiment, includes a plurality of system board saw «\ ♦ho,., 

10020] The switch 20, in one embodiment, may be a 18x1 8 crossbar switch that allows system board sets sail „t 
and system control boards 15(1 -2) to commute. If desired. Thus, the swrtch 20 m^alll me SsSSol 
boards 15(1-2) to communicate with aach other orwith other system board sets 29(1-n) as wel as^lloTtte £22 
board sets 29(1 -n) to communicate with each other. n;, as wall as allow the system 

k ^ f**™ 110 ** a"" 2 * 1 " n) ' in ° TO emb odiment. emprise one or more boards, including a system board 
30, I/O board 35, and expander board 40. The system board 30 may include orocesaon, »nd . 

[0022] in one embodiment, the system 1 0 may be dynamically subdivided Into a plurality of system dom^n* 
each domain may have a separate boo, disk (to execute a specific instance of the oSngCem ^^1^? 
separatediskstomge. network interfaces. and/or. /Oir^rfaces Each 

machine that performs a varlaty of user-eonfigured services. For example, on* or ^ST^SSZ 
as an apphcat™ server, a web server, database server, and the like. In one embodiment e™h donCn mat 5 

ELi?^ V^T^ T 6 r mp,a ' y arran 3 emefrtwha ^ -I bast two domains are defined inthe system 10 The 
first domain, .dantrfiad by vertical cross-sectional lines, includes the system board set 2S(n/2 + 2) the system boait 1 30 
of the system board Set 29(1), and ft. I/O board 35 of the system h^Ml^.Th..^2ffiStai^ 

board set 29(1 ) and the system board 30 of the system board set 29(2) 

E 02 ?™^!^ «w main may b9 1or,,leC, ° f an en, ' res y at «'n boardset 29(1-n), one or more boards (eg system 
K h " 5? TV*"" T""' 3yet9m ^ 89,9 29(1 -" ) ' ° r 8 c ^lnation thereof. Although m£E£ 

2, IZnr wZ„ ^T'^ J conceivab, y hav * «P to V ('•*. the number of systemboard sots) 

dfferent domains. When two boards (e.g., system board 30, I/O board 35) from the same system boanJ set 29f 1-n 

[0025] Using the switch 20. inter-domain communications may be possible. Par example, the switch 20 may provide 

3 T* - ^ ,0f ^ 3nd ^ ,hr ° U9h *• Switth 20 be used for inter- 
[0026] Referring now to Figure 3, a block diagram of the system board sat 29(1-n) coupled to the switch 20 is iilus 

SJLllL2^!SS2T m inChjdeS f ° Ur Pr009SSDrS Wi,h of ,hB Procosso^ 3M(7-?having 

an associated memory '361(1 -4). In one embodiment, each of the processor* 360(1-4) may b. coupled tea 
c^che memo 7 362(1 -4) In other embodiments, each of the processor* 360(1-4, may have more than on aZSted 

mt- in"' ? Sre L n ° r a " ° f ^ ™ 01 cachB memor « S ™ side within the processorslSn) 
in one embodiment, aach cache memory 362(1-4) may baa split cache, wham a storage portion of the cX memory 
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10 the pr0C08ew • and * comro1 •"* n (e *■ ta98 and ,te98) may 4,6 re8id8nt 

^362(1-4),a6weM asaccessthememonasassociatod with otherpKwsaorsJnonft embodiment a different numhl 
of processors and memones may be employed In any desirable combination, depending ^n the 'SSSZSSSZ 

(1-2)/memorles 361 (1-2) and processors 3e0(3-4)/memorles 361 (3-4)) to a board data swi^h <*7 P roce330rs 360 

[0028] Although not so limited, the I/O board 35 of each system TboS I set 2/! IS in ,n . . 

includes a controller 370 for managing one or more of th! pS ™ J ?° ? , ^sXod embodiment 

u ool^ 2" controllers 370 ' 374 °» *• "° boa* 35, in one embodiment, are coupled to a data switch 37a A 
switch 380 ,n the expander board 40 receive* the output signal from the switch 378 of the 1^ board 35^5 L^hl 

The SDI 383 may process data transactions to and from the switch 20 and the system and I/O boards 30 SSTl 
separate address path (shown In dashed lines) is shown from the p»e«3MK^Sl»o^^wi 
3Hi£T eXpan *:? l f" B W ™ dula 382. In the illustrated embodiment, the SD issl ££££ . KLJ 384 

board 40 of the system board safe 28(1-n). Thus, in one embodiment, the swiLTomSS*SL1lffa^» 
swrtchesthat provide a separate data path, address path, and control signal path to ^M^VnH il^ T 

TtZSZ- ^ • ^ f mr °' 3 ' 9nal traff,a ° ne w**™*. «io switch 20 may provide a bandwidth of 

about 43 exabytes par second. In other embodiments, a higher or lower bandwidth may be Sieved using Zswitch 

[0031] It ehould be noted that the arrangement and/or location of various components (a a AXO moH..i a vo 
essors 380(1 -4). controllers 370. 374) within each system board set 29(1 -4) is SZ^^S^tJSiZ 
vary from one .mperoenlat.on to another. Additional more orfewer components maybe employed wCtdSng 
from the scope of the present Invention. 

i°? 21 2^° C ° hererlCy may be P 8rformed at two different levels, one at the intra-system board set 29(1-n) level 

SZSlSl TT b T 29( l n> teWL W ' ,h * ,he ,iret ,aVel ' -^e coherency witJn each syiZ 

boa* a* 129 1-n) 18 performed, ,n one ambodiment. using conventual cache coherency snooping techniques such 
as the modified, exclave, shared, and invalid (MESH) cache coherency protocol. Aa eJch, the pr^essore 362^ 

tzr^^^^ 

[0033] Because the number of devices within the system board set 29(1 -n) may be relatively small a conventional 

«nh w ,9CBVe - HOWeVer ' beCaUSa <h& SVBtem 10 raa * Cftrtain a la ^ e nu " ,b ^ <* astern boartl sets 29(1 nT 
^nhi IS? a' mo ^P7 cessors - m f m 7 *« ce5S *» «ay requirealargenumberof broadcasts before suchrequests 

SSm) pSSol m eXpanCter b0ard 40 usin S- in embodiment, tho scalable shared mar^^ 

t °°^ Q I he * XQ m0dule M2, in one Bmb « ,irnen «, includes a control unit 389 coupled to a home agent 390 a reauest 
agent 392. and a slave agent 394. Collectively, the agents 390, 392, 394 may operate to aid in Saining S 
JiSS^iSSS^ *• ««W -it 389 of the AXQ module 382 ^.^SSL^S, 

I ? Iff?** 88 We " 35 ln,ere<mnacte »» ^ent 390, request agent 392, slave agent 394 within 
J! ^ h m °2 'n Jl° n ! embodimem ' » tf » exP^'terboard 40 is split between Z domains (/.e..The system ano 
the I/O boards 30 and 35 of one ^em boatd set 29(1-n) are in drfferen. domains), the control unit 389 o^ the AW 
module 382 may arbitrate the syatem boart 30 and I/O board 35 separata*, one on odd cyctes. and tf» otte r on even 

I0035 L Y!f Pr ° ,0C01 " 88S MTa93 enibedded in th « dsta to control what the davices under the control of each 

Z a "nf ™^ ZZ"T l ° 3 ^ " nS - Th8 mBy h9 flto red in the^ches 362(1-4) of each sysZboardS 

29(1-n). Table 1 below .llustrates three types of values that may be associated with MTags 
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j MTag Type 


Description 


Invalid (gl) 




Shared (gS) 


A read may complete, but not a write. ' 


Modifiable (glW) 


Both reads and writes are permitted to thle line. — 



10 



25 



30 



55 



[0036] As mentioned, the Mtag stales are employed In the illustrated embodiment in addition to the conv^n*. 

ooarfl 40. rf the line is not gM, then a remote transaction may have to be done involving the cache AXQ module 382 
wh IC h, as mentraned, employs the SSM protocol In one embodiment ' 
[0037] The AXQ module 382. In one embodiment, controls a directory cache (DC) 396 that holds inform»tinn a *n,« 
fines of memory mat have been recently referenced using the SS M protocol. The DC iSX SJSSSl'S 
be stored in a volatile memory, such as a static random access memory (SRAM), The DC 396 may - LTSSSSZ™ 

board 40. The AXQ module 382 controls a locking module 398 that oreverrts access to a «.i»M» ri ^ • 1 J 1 ™™*' 
cache 396 when the status of that entry, for exarnpfe. Is ^ t ^^^'^ te *'^^^^ m ^ 
10038] The DC 396 may be capable ol caching a predefined number of directory entries correaoondino « wte 

™l x lzT\ m : - 4) ,or a ? ™ r e/n board 3o - The dc 396 ™* * «i^532ffh?s: 

LZZ^dtTJ ° f t ™^ used , m ^^J»tocKs may general* be cached. AHhough not so limited, m the 
illustrated embodiment, the DC 396 is a 3-way sa*-associstive cache, formed of three SRAMs that can h„ LJZ 

2 tie agrees ' * * ^ 5 41 °* ° C in a fllven Mt 41 0 ™V be indexed by'a hash 

[0039] As shown in Figure 4A, in one embodiment, each of the three DC entry fields 41 5(0-2) hae an associated 
address panty field 420(0-2). The address parity field 420(0-2, in the illustrated embodiment I a ^SSS*3 
enables error detection ,n the address being sent from the AXQ module 382 to the directory cache M6 An error in the 

rtt^ZJX™™^ T, th °. AXQ m ° dU,e 382 10 ,hS direCt0 ^ «*• 396 rJy occur E on *7v^ 
of reasons, including because of a faulty wiring connection between the AXQ module 382 and the directory cache 396^ 
a faufty pin on either the AXQ module 382 or directory cache 396. and the like. * ' 

[0040] The set 410 of the directory cache includes two error correction code (ECC) fields 425/0-1 ) e»h h«„n B 

s^re^^ 

t P t« 9 T f nUmb6r ° f bit error * For in ona •mbodhnent, the contents of the 

ECC field 425(0-1 ) may be utilized to detect and correct a slngle-bit error 

[0041] Each 3-wide DC entry 410 includes a least recently modified (LRM) field 430 that may identfv which of the 
°V " Sld9 41S( °- 2) WM leaSt modified AWugh «her encoding teohnique/mTy^JZ^n 

the illustrated embodiment m provided in Table 2 below. y 



415 



50 



55 



TABLE 2: 



DC Least-Recently-Modrfied encoding 



LRM 


Most Recent 


Middle 


Least Recent 


000 


Entry 0 


Entry 1 


Entry 2 


001 


Entry 1 


Entry 0 


Entry 2 


010 


Entry 2 


Entry 0 


Entry 1 


011 


***undefined state *" 


100 


Entry 0 


Entry 2 


Entry 1 


101 


Entry 1 


Entry 2 


Entry 0. 


110 


entry 2 


Entry 1 


Entry 0 
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TABLE 2: 


(continued) 




Least-Recontly-ModHied encodina 


LRM 


Moflt Recent 


Middle Least Recent 


111 


undefined state *** 



•OOt)' (/.•., the first entry in Table 2) indicate that th. J*L aLu a<<zZ T ™dtf.ed. As an example, the digits 

(004fl In one embodiment, two different types of entries, a shared entry 435*and an owned entry 4S7 mau h a 
m the entry fields 41 5(0-2) of the DC 396. as shown in Ffcures 4B-C A n owned e^«7^n^S.« \ , » 
thata partteularexpander board 40 has both read and vSfte ^^S£!l^T^^d^ 

W L h * ^ an,f y 43S - in <™ embodiment, includes an identifier field 440, a mask field 445 and an addZ 
tog told 450. The identifier field 440, in the illustrated embodiment, is a single bh field 440 whteh if eoual SET 
rrficates tfml the stored each* line is shared by one or more of the processors 360 -4, of Jh Astern 29 

sets 29(l-n)), identifies through a senes of bds which of the system boards 30 of the system board setsWn) has 
a a snared copy of the cache line. The address tag field 460 may etcre at teaat a portton oT to addZ ffiild of to 
corresponding cache line, in one embodiment caress Tieia ot tne 

10046] The owned entry 437 indudes an idenWerfield 455. an avalidfield 
S5 «lV B T i ° n 4?5 ' °~ • n *«»™* identifier field 455, In the m^^nS^xXt!£ 

bit f, aid 440 which, if equal to bit 0, Indicates that the stored cache line is owned by one of the p^XmTt 
so the system board sets 29(1-n) in the system 1 0. The owner field 460 la adapted to store the idVntJy" JHSfall 

1 ^ bi^** 10 M leas < an 'lentrfying portion of the address field of the contending cache line in one 

TT^JSZEftSZ JS ^ 'V***" * the u "»' — bfts «* ^ a^ss^e'Iaiid fie" 
« Sl l!T^ , L corr * 8 P 0nt, " 1 9 entr V in the DC 396 is valid. An entry In the DC 39S may be 
as nvafd at start-up, for example, when the system 10 or domain in the system 10 is first Initialled 

[0047] Referring now to Figure 5, a flow diagram of a method of storing data in each set 410 fsae Fiour« at nf th* 
directory cache 396 (see Figure 3) is illustrated, in accordance whh ™«*JEE££ZSX^J££l 
AX ° m °l^ a* f«» Ffgure 3) receives (at 510) data to store in the entry 415(0-2) of the JSJT«£J So2j 

SAC 4A) 5LK 4T0 * te in,om,a,ion 5tar8d in one or ™ re 01 •» enhy ,i8,de 415 < 1 -3 

[0048] The AXQ module 362 calculates (at 520) one or mora address parity bits for the data received (at 520) based 
on the directory cache address of where that data is .0 be stared. For example, based on the l«ZES»n2wtaS 
r^T! * U ^ *° red «• director cache 396, the AXQ module 362 calculates the apprSate addre^ 
46 parity bits for storage in the address parity fields 420(0-2) ( 8ee Figure 4) of the set 41 0 In amZSSXTStT^ 

<15(0^ 

[0049] I The AXQ module 362 in the illustrated embodiment calculates (at 630) an ECC value based on the data 
receded (a 61 0) and the one or more address parity bits calculated (at 520) earlier. In one emboSmem to £S 

ffl KS£S£.2(i7 to 80 ECC value - wi,h ,our beinB8 8,ored ,n ** ECC fi9ld ««f 

[0050] The AXQ module 382. upon calculating (at 530) the ECC value, stores (at 540) the data in the fields 41 5(0-2) 

ofttdiraSir^^ 

55 cac S h 1 e 396 lISTT I ^ZZ^'^ *S"? - a mBthod o1 """> «nors in the directory 

?at eosfon^ ZmZ^JZZ^T* ^X^T* * ^ Pr< *™ iw9rtl0f '- Th9 AXQ module 382 aC ce EB « 
(at 605) one or more stored entries .n the set 410 of the directory cache 396. In one embodiment accessino rat 8051 

the s^redemriesmaymclude reading the ooments of the various fcfc rt «h.«4iai*ta^T^£S 
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parity bits. ECC value, and tha data stored in the entrv fields A-izm* < 

set 41 0 may be accessed (at 60S), h ehould thal '? T ° mb ° d '™"t, the contents of the entire 

data stored therein, and may thus simply contains cnJlTr IZ1 £ 'X** V ^ may not hav * valld 
10052) The AXO modub 382 calculates (a, eToTan ECC lvalue there ° f 
5 accessed (at 605). Thus. In one embodiment X/K nSjS?^ . . f ° 1 ih * 0n ° or Wore entnea 
t r the entry fields 415(0-2,, URM SS addTs ^£3S3!r *• EC ° ^ ba6 * d on «" -«* 
10053] The AXO modulo 382 compares {at 61 5) tha ca\ru\<*a* a* m e^r* , 

ECC Adds 425(00 ). Based On the complin fat ^S^S^^S^^T^^ MM ^ 
« one or more errors exist in tha cortenWaccessed at SOSH™^ 2 ~ ^ m ° dU ' e 382 detflrmir » 8 (« 620) 

SUr^r^ 

example, if tha cafculatad (at 610) address bit »«S22£ to SSSKE ? f ba < rt " n *» d <* «*>. 'or 

« which the entries were accessed (atlST) by the ^^J^Mftlf ™ ^T?™ ^ 
bit enor exists, than tha AXQ module 382 Ja y coS 6 m ^JSSSL^ 

the error detection and amJSSSSSSX tS^S^l «" T j?" " C ° rreC,ab,B " 1 

I* 1 ^ to correctabte ('•*. an 8-bit ECC value may be uLd to cSreeta 

amgla-M error, but not a two^it error). Assuming it is determined (at 640) that tha error is corLt^lHhLn thlTv^ 

[0056] If it fa determined (at 626) that an address parity bit error exists or if it ia determined (at 635) that the error is 
not correctable, than the AXQ modute 382 invalidates (at 650 - See Figure 6B) the contents a cce7sad St eolZl 

he dM-actory cacha 396 and indicating a cache miss to tha device requesting tha information 

[0057] Tha AXQ module 382 places (at 666) the directory cacha 396 offline since tha directory cache 396 may ba 
inoperable or partially inoperable tor one of a variety of reasons. The directory cache 396 In ona embodiment Lv 

While the performance of a domain may adversely be affected while the directory cacha 396 is offline, the domain in 
one embodiment, may nevertheless continue to operate. Once the directory cacha 396 ia placed (at 655) off-line 'all 
subsequent accesses to the directory cache 396 may ba treated as misses. In ona embodiment, the directory cacne 
396 may be placed (at 655) off-line dynamically, while the domain or the expansion board 40 (see Figure 3) to which 
the directory cache 396 belongs Is active or in operation. 
40 [0< L M1 , J" !T! "'"ow ad emb0alment - the AXQ mod «'e 382 provides (at 660) the error information to the system 
control board 15(1-2) (see Figure 1). The error information may include information indlcatingihe source ol the failure 
For exampte. in one embodiment, the AXQ module 382 may indicate that an address parity bit failed for a selected 
portion of the directory cache 396, thereby indicating that address-related probleme may be present As an additional 

«s SK***? m ° 6 t 3K ma * lndicate ,hat 30 ""Stable error exists in a selected portion of the direct 
« cache 396, thereby signifying a problem with the storage RAM*, for example 

[0059] The system control board 1 5(1 -2) may perform (at 670) diagnostics on the directory cache 396 based on the 
error information provided (at 660) by the AXQ module 382. Tha diagnostic tests may aid in solving one or more 

GIZSXbiZF' I" 8 direC,Wy CaCh ° 396 ^ bS 675 > baaad °" *• poetics ™ 

so STl? * l"™. '" StenCeS : ST 8y8t9m COn,r ° l bCard 1 6<1 mav b * to re80 've an identified problem In 
J? i*"- 3 ' *• °° ntr01 bMnl 1S(1 " 2) mRV indlca,e to a 6 y 6tem edministratorthe nature of the preWem 

Slai ByStem adminfetrator mav then resolve "V outstanding problems with the directory 

[0060] In one embodiment, the directory cache 398 may be tested (at 670) and serviced (at 675) dynamically white 

Kit 1 ^?.^T 382 may P ' aCe <3t 680> d '" aaorf 030,18 396 the dl rector cache 396 has 

SS^nS 2- ln ™* mb °6m T t, the directory cache 396 may be brought orpine by resetting a bit In the 

Z j^ilT' n 18 lndk f ^ " Wh8lhBr * 9 direCt0,y Cacha 398 is ^^^^ » *• °ther resourced 
the domain in the system 1 0. In one embodiment, the directoiy cache 396 may be placed (at 680) on-line dynamical" 
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while th9 domain or th9 expansion board 40 (see Figure 3) to which the directory cache 396 belongs Is active or in 
operation. 

[0062] For ease of illustration, several references to "cache line" or "cache lines" are made in the discussion herein 
with respect to memory access. It should be appreciated that a "cache line," as utilized in this discussion, may include 
* one or more bits of Information that is retrieved from the caches 362(1-4) (see Figure 3) in the system 10. 

[0063] While one or more embodiments have been described herein In the context of the directory cache 396 (see 
Figure 3), It should be appreciated that one or more embodiments of the present invention may also be applicable to 
other storage devices, including a main memory, a cache, a hard drive, and the like. 

[0064] The various system layers, routines, or modules may be executable control unite (such as control unit 369 
io (see Figure 3), Each central unit may include a microprocessor, a microcontroller, a digital signal processor, a processor 
card (including one or more microprocessors or controllers), or other control or computing devices. 
[0065] The storage devices referred to in this discussion may include one or more machine-readable storage media 
for storing data and Instructions. The storage media may include different forms of memory including semiconductor 
memory devices such as dynamic or static random access memories (DRAMs orSRAMs), erasable and programmable 
read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash 
memories; magnetic disks such as fixed, floppy, removable disks; other magnetic media including tape; and optical 
media such as compact disks (CDs) or digital video disks (DVPe). Instructions that make up the various software layers, 
routines, or modules in the various systems may be stared in respective storage devices. The instructions when exe- 
cuted by a respective control unit cause the corresponding system to perform programmed acts. 
so [0066] The instructions can be provided as one Dr more computer programs, routines, modules, software layers, etc. 
on one or more carrier media. Suitable carrier media include a storage medium such as, by way of example only, 
optical, magneto optical, magnetic, solid state, tape or disk storage media, or a transmission medium such as, by way 
of example only, wired, wireless, optical or electromagnetic media forming part, for example, of a network, point to 
point, or broadcast communications medium. 
25 [0067] The particular embodiments disclosed above are illustrative only, as the invention may be modified and prac- 
ticed in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. 
Furthermore, no limitations are intended to the details of construction or design herein shown, other than as described 
in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified 
and all such variations are considered within the scope of the invention. 

so 

Claims 

1. A method, comprising; 

95 

detecting an error in data stored in a storage device in a system; 
determining If the detected error is correctable; and 

making at least a portion of the storage device unavailable to one or more resources In the system in response 
to determining that the error is uncorrectable. 

40 

2. The method of claim 1 , wherein detecting the error comprises detecting the error m the data using error correction 
code. 

3. The method of claim 2, wherein determining if the detected error is correctable comprises determining that the 
*s detected error Is a multi-bit error. 

4. The method of any preceding claim, wherein determining if the detected error is correctable comprises determining 
that the detected error is an address parity error. 

so s. The method of any preceding claim, wherein making at least the portion of the storage device unavailable comprises 
making at least the portion of the storage device unavailable while the system Is in operation, 

6. The method of any preceding claim, further comprising testing the storage device based on determining that the 
error is uncorrectable. 

SS 

7. The method of claim 6. further comprising servicing the storage device In response to testing the storage davica, 

8. The method of claim 7. further comprising dynamically allowing access to the storage unit in response to servicing 
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the storage device. 

9. The method of any preceding claim, wherein the storage device includes a directory cache, and wherein making 
at least the po rtion of tha storage device unavailable comprises generating a cache miss in response to a request 

^ to access the directory cache, 

10. An apparatus, comprising; 

a directory cache adapted to store at least one entry; and 
10 a control unit is adapted to: 

determine if at least one uncorrectable error exists in the directory cache; and 

place the directory cache offline in response to determining that the error is uncorrectable. 

1$ 11. The apparatus of claim 1 0, wherein the directory cache is a three-way associative directory cache. 

12. The apparatus of claim 10 or claim 11 , wherein the control unit determines If the entry contains a multi-bit error. 

13. The apparatus of claim 12, wherein entry is an address bit entry, and wherein the control unit determines if the 
so address parity bit entry contains an error. 

14. The apparatus of any of claims 10 to 13, wherein the directory cache b associated with a domain, and Wherein 
the control unit places the directory cache offline white the domain is active. 

& 15. The apparatus of claim 1 4, wherein the control unit provides a cache miss to a device requesting to access the 
directory cache while the directory cache is off Una 

The apparatus of claim 1 4, wherein the control unit tests the drectory cache in response to determining that the 
error is uncorrectable. 

SO 

17. The apparatus of claim 15, wherein the control unit causes the directory cache to be serviced in response to testing 
the directory cache. 

18. The apparatus of claim 1 5. wherein the control urrit places the directory cache orvline in response to causing the 
& directory cache to be serviced. 

19. The apparatus of claim 1fi, wherein the control unit places the directory cache Online dynamically. 

20. An article comprising instructions that when executed enable a processor to: 

40 

determine a multiple-bit error in data stored in a storage device of a domain; and 

isolate at least a portion of the storage device from one or more resources in the domain white the domain is 
active, in response to determining the muhiple-bit error. 

21. The article of claim 20. wherein the instructions when executed enable the processor to perform an ECC error 
check to determine the muftipte-bit error In the data. 

22. The article of claim 20 or claim 21 , wherein the instructions when executed enable the processor to dynamically 
test the storage device in response to isolating the storage device. 

so 

23. The article of any of claims 20 to 22, wherein the instructions when executed enable the processor to dynamically 
restore the storage device in the domain. 

24. The article of any of claims 20 to 23, wherein the instructions whan executed enable the processor to provide a 
55 cause of the multiple-brt error. 

25. The article of any of claims 20 to 24 on a carrier medium. 
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